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1. Introduction and Motivation 


There are many applications of eigensolvers in Lattice QCD: E.g., many physical properties 
are encoded in the spectrum of the Dirac operator, D, and the lowest eigenvalues and eigenvectors 
of D^D can be used in low-mode averaging to reduce the noise of stochastically estimated quantities 
like disconnected fermion loops. However, the calculation of the lowest eigenmodes of the Dirac 
operator can be costly and scales with V where V is the lattice four-volume and Neig is the 
number of the lowest eigenmodes and often Ngig V. There are two possible ways to alleviate this 
problem and make eigenmodes affordable for the use in low-mode averaging and other applications 
[1-5]. One of them is the development of more efficient solvers, the other is to relax the precision 
of the eigenmodes. In this work we pursue both paths. 

Several adaptive algebraic multigrid methods have been proposed in recent years as linear sys¬ 
tem solvers for the non-Hermitian formulation; cf. [6-10]. In particular, we proposed an adaptive 
aggregation based domain decomposition multigrid (“DD-aAMG”) method to solve linear systems 
with the non-Hermitian Wilson Dirac operator D and observed large speed-ups over conventional 
Krylov subspace methods. In what follows we present a modification of this method that allows us 
to also solve systems with the Hermitian version of the Dirac operator Q = y^D. In order to use 
this linear systems solver to calculate eigenmodes of Q we employ a standard Rayleigh quotient 
iteration (see [11]), where shifted systems Q — o need to be inverted. In this process, the current 
eigenvector approximations are built into the interpolation in each iteration, which allows us to 
view the eigensolver also as a setup procedure for the multigrid method itself, and thus enables 
the calculation of eigenvectors corresponding to small eigenvalues to any desired accuracy. We 
compare a MPI-C implementation of our eigensolver with PARPACK and show that speed-ups of 
roughly an order of magnitude can be achieved. 

Subsequently, we use this method to obtain both high and low accuracy eigenmodes of the 
Hermitian Dirac operator and use these in low-mode averaging which we apply to the pion- and 
eta-correlators. By constructing an improved estimate for approximate low-mode contributions we 
are able to benefit even more from the faster calculation when relaxing the target residual. The 
introduction of a cutoff enables us to use the test vectors from the standard DD-aAMG setup [9] 
without further iteration on the eigenvectors. By combining these two approaches, we obtain final 
sfafisfical errors which are of roughly fhe same magnifude as fhose obfained when using exacf 
eigenmodes buf af a much smaller fofal cosf. 

The sfrucfure of fhis work is as follows: In Sec. 2 we describe our mulfigrid eigensolver 
algorifhm for Q using Rayleigh quofienf iferafion and subsequenfly compare fhe performance fo 
PARPACK in Sec. 2.2. In Sec. 3 we apply fhis mefhod fo low-mode averaging for fhe efa- and pion- 
correlafor in fhe Iwo-flavour sysfem. Afler a short infroducfion fo fhe main fechniques, namely low¬ 
mode averaging and sfochaslic esfimafion, in Sec. 3.1 we presenf improvemenf fechniques for fhe 
inexacf eigenmodes (Sec. 3.2), and in Sec. 3.3 we devise a criferion fo resfricf fhe sef of eigenmodes 
so fhaf we can use fhe fesf vecfors of fhe mulfigrid sefup direcfly. Finally, we compare errors and 
fhe achieved speed-ups in Sec. 3.4 before we conclude in Sec. 4. 
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2. Algebraic Multigrid in Rayleigh Quotient Iteration 


Let D be the non-Hermitian (Clover-improved) Wilson Dirac operator and Q := 75 D its Her- 

mitian version (75 = — 71727374 ) and assume that we choose a representation of the 7 such that 
^ 


75 = 


-1 


I. An eigenvector | v) / 0 of 2 with corresponding eigenvalue A satisfies 


Q\v)=X 


( 2 . 1 ) 


If A is small in modulus, we call | v) a small eigenvector or low-mode. In [9, 10] we proposed an 
adaptive algebraic domain decomposition method termed “DD-aAMG” for solving linear systems 




( 2 . 2 ) 


The error propagator for the two-level version of our method is - as for many other two-level 
approaches - of the generic form 

E2g = (1 -MD)''(1 -PD^^RD){1 -MDf, (2.3) 

where M denotes the smoother which is given by the Schwarz alternating procedure (SAP), jx and 
V number the pre- and post-smoothing iterations, respectively. P denotes the adaptively constructed 
aggregation based interpolation / prolongation [6-10] and R the corresponding restriction. In order 
to preserve the 75 -hermiticity of D in the coarse grid system := RDP, as in the other approaches 

^ in accordance to the ordering of spins in 75 , so that P fulfils 

Y5P = Pt5, (2.4) 

where is the coarse grid analogue of 75 . Thus, further choosing R = P^ we obtain a symmetric 
coarse operator that fulfils 

= P^j,DP = P^D^Pf, = Dlf, = . (2.5) 

This algebraic multigrid approach for D can then be easily transferred to one for Q. In fact it 
has been shown already in [ 8 ] that using this construction for P and R the coarse grid corrections 
for D and Q are identical if one chooses Qc := P^QP, i.e., 

1 = 1 (2.6) 

The remaining part is to find a smoother for Q and define a way to solve systems with the coarse 
grid system Qc. Numerical experiments show that SAP is not suitable as a smoother for Q. Since 
full GMRES is known to converge for any linear system, restarted GMRES in practice may work 
as a smoother for Q for suitably chosen restart lengths. At the same time, GMRES is one of 
the most numerically stable Krylov subspace methods for indefinite systems and is thus to be 
expected to work well as a solver for Qc. Therefore, we supply (restarted) GMRES as smoother 
and coarse grid solver which makes our algebraic multigrid solver for Q similar in spirit to the one 
proposed in [ 6 - 8 ] for D. Due to the fact that the multigrid method is adaptively constructed to 


just cited, we choose P = 
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Algorithm 1: Rayleigh Quotient Iteration + algebraic multigrid 


input: number of eigenvectors Neig, desired accuracy e 
output; eigenvectors | vi| 

1 let I viI be orthonormalised random vectors and A,- = 0, £, = 1 V/ = 1,... ,Neig 

2 build P from | vi| 

3 while Bsi : Ei> e do 

4 for all / = 1,. .. ,Neig with e,- > £ do 
a ^ A; • max(l — £,-, 0 ) 

I Vi) < 

vi)^\vi)-riJ 

I Vi) ^1 V,-)/||| v,')|| 

update Vi in interpolation P 

h = {vi I Q I V/) 

£/ = 112 I Vi)-?ii I V;)|| 


iQ-o)-^\vi) 

i) ^\Vi)-L‘j=\iiVj\Vi)) \Vj) 


efficiently treat the low modes of Q on the coarse grid, it should also do so for 2 — a as long as 
a (in modulus) is sufficiently small. This then allows to construct shift-invert eigensolvers where 
algebraic multigrid accelerates the eigensolver. The simpler shift-invert approaches supply shifts 
a close to an eigenvalue A such that Q — G becomes very ill-conditioned. It has been shown in [6- 
10] that algebraic multigrid methods for D are less sensitive to the condition number than Krylov 
subspace methods. Since this also holds for algebraic multigrid for Q, the coarse grid corrections 
being equivalent, it appears attractive to use algebraic multigrid for Q in an eigensolver setting. 

2.1 Description of the Algorithm 

A standard shift-invert eigensolver approach is the Rayleigh quotient iteration (RQI, see, 
e.g., [11]). We now describe RQI that uses our algebraic multigrid method for the inversion of 
the shifted systems in detail. We initially choose a set of Neig orthonormalised random vectors 
I vi),..., I vn^.^) and corresponding eigenvalue guesses Ai = ... = A/v = 0. Using the multigrid 
method each | v;) is updated via one shifted inversion | v;) (2~‘^i) ^ I Vi). Subsequently, 

the vectors | vi),..., | are re-orthonormalised and the eigenvalue guesses A,- are updated as 
h = {vi I 2 I Vi). This process is iterated until the norm of the eigenvector residual \\Q\vi) — X \ v,) || 
is smaller than a given tolerance £. As opposed to standard RQI, the solver that is used to solve the 
shifted linear systems is updated in every iteration by re-building the interpolation operator P from 
the recent eigenvector estimates | vi),..., | 

The described procedure is summarised in Alg. 1 . In practice we do not start with entirely ran¬ 
dom vectors | vi),..., | vjg^.^) but with the test vectors generated in the setup phase which constructs 
the interpolation P used in DD-aAMG. This setup procedure, described in [9], applies a small 
number of smoother iterations to a set of random vectors to form a first interpolation P. Then, 
one iteration of algebraic multigrid is applied to each test vector while keeping them orthonormal. 
This procedure is repeated a few times until one is satisfied with the convergence of the algebraic 
multigrid solver. 
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Figure 1: Comparison of PARPACK and RQI+AMG, core hours needed to compute the Neig = 20 smallest 
eigenvectors of Q for lattice sizes ranging from 48 x 24^ to 64"*^ at constant physics. 

In practice we observed that sometimes the Ngtg computed eigenpairs (A,',| v;)) are not the 
Neig smallest pairs. Usually, the eigenpairs with A closest to 0 are always met, but some of the 
remaining smallest eigenvalues might be missed. This can happen if the starting guess has only 
very little overlap with the direction of the desired eigenvector or if the estimate for the current 
eigenvalue is too large. In order to reduce the frequency of such events, we introduce an accuracy 
dependent damping mechanism in line 5 that restricts the magnitude of the shifts used in the shifted 
inversion. 

2.2 Scaling and Comparison with Other Algorithms 

In this section we give preliminary results for our Rayleigh quotient iteration with multigrid 
algorithm (RQI+AMG, Algorithm 1) that we implemented within our existing DD-aAMG frame¬ 
work based on the programming language C and the parallelisation interface MPI. The Rayleigh 
quotient iteration is performed in double precision and each inversion is computed by a double pre¬ 
cision flexible GMRES solver preconditioned with single precision algebraic multigrid. All results 
that we state in this section were obtained on the Juropa machine at Jiilich Supercomputing Centre, 
a cluster with 2,208 compute nodes, each with two Intel Xeon X5570 (Nehalem-EP) quad-core pro¬ 
cessors. This machine provides a maximum of 8,192 cores for a single job. The code is compiled 
with the icc-compiler using the optimisation flags -03, -ipo, -axSSE4.2 and -m64. 

In Eig. 1 we compare the amount of core hours needed to compute the Neig = 20 smallest 
eigenvectors of Q with RQI+AMG and with PARPACK [12]. The latter is a publicly available 
parallel Arnold! type eigensolver which is widely used in the lattice QCD community. It builds a 
Krylov subspace of a chosen dimension and estimates Neig eigenvector approximations in this 
subspace. Thereafter the procedure is restarted, keeping the Neig approximations and improving 
them within a new subspace which consists of the Neig approximate eigenvectors and Nkv — Neig 
new vectors coming from a new Arnold! iteration. In Pig. 1 we used Nkv = 100. We observe 
that RQI+AMG outperforms PARPACK by almost an order of magnitude already on rather small 
configurations of size 48 x 24^. Por a lattice volume of 64 x 40*^ PARPACK already exceeded the 
24 hours job limit with 1024 cores. The curves displayed in Pig. 1 also hints at that RQI+AMG 
scales better with the lattice size than PARPACK does. This is a major advantage for today’s 
large volume simulations. Note that all configurations are at constant physics, i.e., they are all two 
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Figure 2: Comparison of PARPACK and RQI+AMG, core hours as a function of the number of smallest 
eigenvectors Neig for a configuration size of 48 x 24^ 


flavour simulations at ntg: « 290MeV and a lattice spacing a ss 0.071 fm (more details on the used 
configurations can be found in [13]). 

However, the situation is less in favour of RQI+AMG when we investigate the scaling with the 
number of desired eigenvectors Neig. In Fig- 2 we compare the scaling with Ngig for RQI+AMG and 
PARPACK. For PARPACK we invariably used Nkv = 200 given that in our numerical experiments 
we did not see any significant difference in runtime when keeping Nkv constant instead of taking 
Nkv proportional to Neig. Note that we always had at least Neig < We observe that the runtime 
of RQI+AMG grows rapidly as the number of eigenvectors Neig is increased whereas PARPACK 
does not show any distinct dependence on Neig. 

Due to the orthonormalisation process in the Arnoldi procedure, PARPACK is expected to 
scale with ^{N^ig). However for small numbers of eigenvectors Neig, the number of restarts 
in PARPACK predominates the overall computations rather than the orthonormalisation process. 
Since we build all Neig current eigenvector approximations into the interpolation P of algebraic 
multigrid, the corresponding coarse operator Qc = P^QP has complexity ^{N^ig), since each coarse 
lattice site holds 2Neig variables which couple with each neighbouring coarse lattice site via a non- 
sparse INeig x 2Neig coupling matrix. Solving the coarse grid system is thus expected to scale at 
least as In future work we plan to investigate an eigensolver approach wherein we do not 

need to build P from all Neig current eigenvector approximations. 

Finally, in Fig. 3 we track the dependence of PARPACK and RQI+AMG on spectral fluctua¬ 
tions for eight statistically independent configurations for two different lattice volumes. We observe 
only minor fluctuations in core hours. 

3. Inexact Eigenmodes and Physics Application 

We now use the above described eigensolver for low-mode averaging which is employed to 
improve the statistical signal of connected [2, 3] and disconnected [1, 4, 5, 14] contributions to 
hadronic observables. In this work, we apply this method to pion- and eta-meson correlators. 

Such noise reduction techniques are particularly important for quark line disconnected con¬ 
tributions. These arise, e.g., when calculating fermionic n-point functions of flavour-singlet quan¬ 
tities. For the eta-meson in the two-flavour {uf = 2) theory, for example, an interpolator is given 


6 






LMA with a new MG Eigensolver 


M. Rottmann, J. Simeth 



Figure 3: Comparison of PARPACK and RQI+AMG, two ensembles at different lattice sizes, 8 statistically 
independent configurations each. 


by 




1 

7! 


(^tixYsttx dxYsdx^. 


(3.1) 


Performing the Wick contractions to obtain the two-point function gives, in the case of mass- 
degenerate quarks, 


C^{x,y)={0^,0]) 

tr (^7 75 ) - «/ tr (^7 Ys) tr 75 ), (3-2) 

where the first term is the connected part that can be calculated cheaply on a single source point yo 
by using 75 -hermiticity and translational invariance, 

tr {D-jj5D;^],Y5) = tr (D-;(d4)1) , (3.3) 


where the trace is over spin and colour indices. For the disconnected contribution, however, the 
propagator starts and ends at the same spacetime point and the calculation of the “loop” D^^Ys 
would require the inversion of the full matrix. In most cases this is computationally unrealistic due 
to the size of D. Instead, one usually employs stochastic techniques for that (see [14] and references 
therein), i.e., one calculates 


2 y^stoch 

Q-i =D ^Y5 = £ I Si){r]i I 

for a sufficiently large number Nstock of stochastic solutions | Si) of the linear system 

Q I Si) =1 ^i), 

where | T], ) is a random noise vector with the properties 


^stock 


^Stock 


£ I T],-) = G{}lsfYiY;YYh) and 


^Stock 


^stoc h 

£ I |=l + ^(Vv^). 


(3.4) 


(3.5) 


(3.6) 


A common choice, which we also use, is to draw the elements of | T],) from Z 2 -|- /Z 2 noise. 
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As is clear from Eq. (3.4), this introduces additional stochastic noise for any finite number of 
estimates which adds to the gauge noise. In other words, Nstoch must be chosen large enough so 
that the gauge noise dominates in the overall statistical error. This requires additional solves and 
becomes more expensive when going down to physical pion masses and large volumes, even with 
modern, e.g., multigrid based solvers. 

To reduce the stochastic noise one can make use of various noise reduction techniques like 
partitioning [5, 15, 16], the truncated solver method [14] or low-mode averaging (for this case 
known as truncated eigenmode acceleration) [4, 5], to name only a few. Which combination of 
these methods works best will in general not only depend on the efficiency of fhe solver buf also 
on fhe observable under considerafion. The efa-correlafor is known fo be low-mode dominafed [1], 
fherefore, if is fhe ideal quantify fo fesf our new eigensolver and invesfigafe fhe use of approximafe 
eigenpairs in low-mode averaging. 

3.1 Low-Mode Averaging 

The basic idea of Low-Mode Averaging (LMA) is fo splif fhe operafor, e.g., fhe Hermifian 
Dirac Operafor Q = j^D, in fwo parfs: 


Q-^=QuL + QHik^ ( 3 - 7 ) 

where confains fhe fhe confribufions fo Q ^ from fhe Neig lowesf eigenmodes: 

Neig . 

= (3-8) 

;• A, 


Qhigh remaining parf of Q ^. 

Lor fhe efa-correlafor (cf. Eq. (3.2)), low-mode averaging works as follows: We need fo calcu- 
lafe bofh fhe connecfed (pion) correlafor Ccon{x,y) = Qyx ) •^he disconnecfed confribufion 

Cdis{x,y) = tr{Q^l)tr{Qyy). LMA can be used for bofh ferms: The connecfed ferm, averaged over 
fhe spafial volume and only depending on fhe Euclidean fime disfance t, reads 


c„„(<)=c;z(<)+cS'(<)=qr{<)+(ceio . 

where wifh x = (x,to + 0> 3^ = (yTo); yo = (yoTo) the individual ferms are given as 

cS:(0=f Etr[(e,;|,)«(e,;iy. 

=Etr [(2, 

X 




(3.9) 


(3.10) 

(3.11) 

(3.12) 


The splitting is performed in a sfraighf-forward way: Lirsf, we calculafe fhe low-mode confribufion 
which uses fhe full (all-fo-all) informafion confained in fhe eigenmodes. The high-mode 
correcfion (fhe terms in the brackets of Eq. (3.9)) is calculated from the exact point-to-all twopoint 
function Ccon and obtained from the eigenmodes at point yo- 
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For the disconnected terms, we correlate two loops at times to and to + t, 

CdUt) = ^'£L{to + t)L{to), (3.13) 

to 

where low-mode substitution is applied to the calculation of the individual loops: 

L(t) =£tr [e-^] (3.14) 

X 

Again, we calculate the low-mode contribution using Eq. (3.8), 

L''’'^(t)=£tr[((2riU. (3-15) 

X 

To remove the high modes from Q, we use the orthonormal projector 

Neig 

^ = t-Y,\vi){vi\, (3.16) 

i 

and estimate 

L^>sh{t) = £tr 

X 

stochastically. Note that = in this case. 

If the low-modes dominate, as is the case for the eta-correlator, less estimates in the calculation 
of are required to achieve a given accuracy. However, the overall cost can be dominated by 
the calculation of the eigenmodes. Therefore, low-mode averaging is often only cost-effective if 
the eigenmodes can be reused many times. 


i^Q) 


(3.17) 


3.2 Approximate Low-Mode Averaging 

Besides the development of faster eigensolvers, it is also possible to reduce the cost of LMA 
by relaxing the eigenmode tolerance in the eigensolver and then correct for this reduced accuracy. 
We denote the /-th approximate eigenpair of Q as (A,-, | v,)) with the eigenmode accuracy 


£i = 


Q I Vi) - Xi I Vi 


and assume that the | v,) are orthonormalised. Further, we define a matrix 


(3.18) 


Aij = {vi\Q\vj), (3.19) 

which we will use to account for the inexactness of the eigenmodes: With | 5v,) denoting the error 
of the approximate eigenvector we have 


V;) =1 Vi)+ I 5vi). 


(3.20) 


Writing A in these terms 

Aij = Xj5ij + Xi{vi I 5vj)+Xj{5vi I Vj) + {5vi \ Q \ 5vj), 


(3.21) 
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Figure 4; The lowest 30 eigenvalues of Q on each of 64 configurations on a 40^ x 64 volume at « 
290MeV. The bottom section shows the exact (£ < 10^^) spectrum. The upper plot relates the eigenvalues 
taken directly after the Multigrid setup with their relative accuracies ^. 


shows that A would be diagonal if the eigenmodes were exact, i.e., if all 5v,- = 0. Using the inverse 
of A instead of the inverse eigenvalues in Eq. (3.8) the expression for the low-mode part of the 
Hermitian propagator generalises to 


l\eig 

QTow = £ I U)(V; 

ij 


(3.22) 


Using this corrected inverse for LMA amounts to replacing by in Eqs. (3.10) to (3.12) 
and Eq. (3.15). Note that in the exact case Eq. (3.22) is identical to Eq. (3.8). The high mode part 
is now obtained by replacing ^ in Eq. (3.8) by an oblique projection 

Neig 

■^ = l-£2|v/)(A-i),7(vy|. (3.23) 

iJ 


Taking A^^ instead of the inverse eigenvalues improves the estimate for by combining 
information from the full space spanned by the inexact eigenvectors. Note that tr = llNeig- 

When combining Eq. (3.22) and Eq. (3.23) together either with Eq. (3.17) or Eq. (3.9) we still obtain 
unbiased results. 

3.3 Use of Multigrid Test Vectors in LMA 

One can go even one step further: Studying the multigrid approach, we observe that the test 
vectors can already be quite accurate eigenvectors directly after our initial multigrid setup phase. 
In the ensemble considered here, the precision after 30 setup iterations is of the order e « 10^^ and 
it turns out that we can indeed use the test vectors as approximate eigenmodes for LMA. 

At the beginning the multigrid test vectors are initialised to random vectors and, since the 
ensemble average of (r | 2 | r) vanishes (where | r) is a random vector), this has the consequence, 
that some of the initial eigenvalues are systematically underestimated and the resulting approximate 
eigenmodes do not respect the mass gap. Including these will obviously affect the quality of 
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C 

Figure 5; The connected pseudoscalar two-point correlator at f = Ni “/2 using only the eigenmodes that have 
a relative error smaller than c. The boxes show the contribution coming only from the inexact low-modes, 
whereas the circles show the corrected two-point function. For comparison, the red bar shows the value that 
can be obtained when using 20 exact eigenmodes with e < 10^^ and the yellow bar marks the conventional 
point-to-all result. 


Luckily, such deviations can be detected by studying the relative precision £/|l| of the eigen¬ 
modes, see Fig. 4. It turns out that the eigenvalues not respecting the mass gap are the ones with 
large ^/\x\. By imposing a cutoff, we restrict the set of eigenpairs used in our computation. 




(A,|v),£) 



(3.24) 


Any normalisation cancels from the above ratio making this choice of cutoff independent of lattice 
volume and pion mass. Once a reasonable cutoff value c has been found, it can be used on different 
ensembles. 

An ideal benchmark quantity for selecting the cutoff is the low-mode contribution to the con¬ 
nected two-point function: It can be calculated without stochastic estimation, i.e., it has gauge 
fluctuations only and the calculation is cheap. Fig. 5 shows how C[‘’^ varies with the cutoff. The 
depicted data are the values at the central timeslice where the relative contributions of the low- 
modes are largest and therefore the effect of using inexact modes can be detected most easily. If c 
is chosen too big we encounter large errors both in and clol! due to the weight given to some 
irrelevant directions by underestimated eigenvalues. In contrast, if c is small, we will not benefit 
from LMA. In any case, after adding the high-mode correction (cf. Eq. (3.9)), within errors we 
always obtain the correct result, as shown by the horizontal bands for the point-to-all and the LMA 
case. We find that a cutoff of c = 0.75 is a good compromise. 

Again we stress that, although the low-mode part is affected by both the number and the 
accuracy of the eigenmodes, the full, high-mode corrected quantity is stable and unbiased. Varying 
the cutoff empirically demonstrates the validity of this statement. 


3.4 Results 

For a first practical test, we use the same ensemble as in the previous sections: A moderately 
large volume of V = 40^ x 64 with two sea quark flavours generated by QCDSF at a pion mass of 
niji K, 290MeV {Lniji « 4.19) and an inverse coupling j8 = 5.29 (corresponding to a lattice spacing 
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a Ri 0.071 fm), see, e.g., [13] for the simulation details. This allows us to explore the methods at 
moderate costs but under real-world conditions. To reduce excited states contributions we use 400 
steps of Wuppertal smearing [17] with smearing parameter 5 = 0.25 for the quark sources and sinks 
and also for the eigenvectors. The gauge field used within the quark smearing is APE-smeared [18] 
with weight factor a = 0.25. 

We perform our calculations on 64 independent configurations. On each of these we compute 
30 inexact eigenmodes, just performing the multigrid setup with 30 steps and no further Rayleigh 
quotient iteration on the test vectors. These inexact eigenmodes have an accuracy of typically 
e Ri 10^^. On average, three out of the 30 modes are excluded by our cutoff choice c = 0.75. 
To compare and verify the inexact results, we also compute the smallest 20 eigenmodes using the 
algorithm described in Sec. 2 with a tolerance of e = 10^^. We refer to them as “exact”. 

Fig. 6 shows the connected (pion) correlator and its relative errors. Due to the spatial vol¬ 
ume averaging, LMA works very well in this case. The connected contribution gives already a 
first indication that our improvement techniques work: We obtain nearly the same errors with ap¬ 
proximate LMA as with exact eigenmodes. For the disconnected contribution, we employ time 
dilution [16, 19, 20] with a distance At = 4a in all cases. In this case the data of both exact and 
approximate LMA agree with the points for which no low-mode averaging has been used as can 
be seen from Fig. 7. The central values show a slightly smoother behaviour. When combining the 
connected and the disconnected data to form the eta-correlator in Fig. 7, the errors increase towards 
larger times and we find that the data do not obey the expected exponential decay. This is probably 
an effect of our limited statistics and insufficient sampling of topological sectors [20]. 

The previous plots show that both exact and inexact LMA reduce the eiTors. Alternatively, 
one could of course also increase the number of stochastic estimates Nstoch so that LMA is not 
necessary. Fig. 8 shows the quadratic error, averaged over the first ten time slices - after that 
point the noise grows rapidly - depending on the number of stochastic solves. This comparison 
shows the efficiency of LMA: In all cases, roughly twice as many inversions are necessary without 
LMA. Approximate and exact LMA are nearly equivalent. Finally, the most interesting question 
is how the methods compare at fixed computafional cost. Fig. 8 shows the actual compute time 
needed for one configuration to obtain a certain error in the eta-correlator. It turns out that using 
the described methods we can reduce the cost of LMA by a factor of roughly ten. This speed-up 
means approximate LMA is cost-efficient and feasible for the use in large-scale computations, even 
if only a small number of different u-point functions needs to be computed. 

4. Conclusions and Outlook 

We made DD-aAMG available for Q by replacing the domain decomposition smoother by 
GMRES and integrated it into a Rayleigh quotient iteration in order to compute the smallest eigen- 
pairs of Q. For moderate numbers of eigenpairs we obtained speed-ups of roughly an order of mag¬ 
nitude over PARPACK. We plan to work on improving the scaling with the number of eigenpairs 
and furthermore to explore possible benefits by incorporating multigrid in other shift-and-invert 
based eigensolvers. 

Moreover, when just using the multigrid setup test vectors, our improved techniques for inac¬ 
curate eigenpairs enable us to get a decent signal for the pion- and eta-correlator that is comparable 
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Figure 6: The connected pseudoscalar (pion) correlator (left) and its relative error at each timeslice (right), 
calculated with exact (red triangles), inexact (orange circles) and without (blue boxes) LMA. 



Figure 7: The disconnected contribution (left) and the full eta-correlator (right), calculated with exact (red 
triangles), inexact (orange circles) and without (blue boxes) LMA, each using Nswch = 20 stochastic esti¬ 
mates. The points are slightly shifted horizontally to improve visibility. 



b 




Figure 8: The average quadratic error of the first ten timeslices of the eta-correlator using exact (red tri¬ 
angles), inexact (orange circles) and no low-mode averaging (blue boxes). The left plot shows how many 
stochastic estimates Nstoch are needed to obtain a certain error (on our limited statistics of 64 configurations). 
The right plot compares the actual cost on SuperMUC at LRZ, taking the time for the calculation of the 
eigenmodes and the stochastic estimation into account. 
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to the result one obtains when using exact eigenpairs, at only a fraction of the cost. Of course 
whether this is possible or whether one needs a few iterations on the eigenvector approximations 
will depend on the observable and also strongly on the volume and pion mass. 

For smaller pion masses when inversions become more expensive and the spectrum is even 
more low-mode dominated the speed-up may be even more pronounced. We will report on this in 
a forthcoming publication. Also, the required number of eigenmodes at different volumes and an 
optimal tuning of the multigrid setup will be investigated, steps that are needed to get an optimal 
speed-up from which we will benefit in future high-statistics measurements. 

This work is partially funded by Deutsche Forschungsgemeinschaft (DFG) within the transregional 
collaborative research centre 55 (SFB-TRR55). All numerical results in Sec. 2.2 were obtained on Juropa at 
Jiilich Supercomputing Centre (JSC) through NIC grant HWU12. We also acknowledge computer time on 
SuperMUC at the Leibniz Rechenzentrum in Garching. 
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